kosovawikiaorg-20200215-history
Wordnet/SenseIdx
Table of Contents NAME index.sense, sense.idx - WordNet's sense index DESCRIPTION The WordNet sense index provides an alternate method for accessing synsets and word senses in the WordNet database. It is useful to applications that retrieve synsets or other information related to a specific sense in WordNet, rather than all the senses of a word or collocation. It can also be used with tools like'grep '''and Perl to find all senses of a word in one or more parts of speech. A specific WordNet sense, encoded as a''sense_key '', can be used as an index into this file to obtain its WordNet sense number, the database byte offset of the synset containing the sense, and the number of times it has been tagged in the semantic concordance texts. Concatenating the ''lemma '' and ''lex_sense '' fields of a semantically tagged word (represented in a ' ' attribute/value pair) in a semantic concordance file, using '% ' as the concatenation character, creates the ''sense_key '' for that sense, which can in turn be used to search the sense index file. A ''sense_key '' is the best way to represent a sense in semantic tagging or other systems that refer to WordNet senses. ''sense_key ''s are independent of WordNet sense numbers and ''synset_offset ''s, which vary between versions of the database. Using the sense index and a ''sense_key '', the corresponding synset (via the ''synset_offset '') and WordNet sense number can easily be obtained. A mapping from noun ''sense_key ''s in WordNet 1.6 to corresponding 2.0 ''sense_key ''s is provided with version 2.0, and is described in 'sensemap(5WN) . See '''wndb(5WN) for a thorough discussion of the WordNet database files. File Format The sense index file lists all of the senses in the WordNet database with each line representing one sense. The file is in alphabetical order, fields are separated by one space, and each line is terminated with a newline character. Each line is of the form: sense_key '' is an encoding of the word sense. Programs can construct a sense key in this format and use it as a binary search key into the sense index file. The format of a ''sense_key '' is described below. ''synset_offset '' is the byte offset that the synset containing the sense is found at in the database "data" file corresponding to the part of speech encoded in the ''sense_key ''. ''synset_offset '' is an 8 digit, zero-filled decimal integer, and can be used with 'fseek(3) to read a synset from the data file. When passed to the WordNet library function 'read_synset() ' along with the syntactic category, a data structure containing the parsed synset is returned. sense_number '' is a decimal integer indicating the sense number of the word, within the part of speech encoded in ''sense_key '', in the WordNet database. See 'wndb(5WN) for information about how sense numbers are assigned. tag_cnt '' represents the decimal number of times the sense is tagged in various semantic concordance texts. A ''tag_cnt '' of '''0 ' indicates that the sense has not been semantically tagged. Sense Key Encoding A''sense_key is represented as: where ''lex_sense '' is encoded as: ''lemma '' is the ASCII text of the word or collocation as found in the WordNet database index file corresponding to ''pos ''. ''lemma '' is in lower case, and collocations are formed by joining individual words with an underscore ('_ ) character. ss_type '' is a one digit decimal integer representing the synset type for the sense. See below for a listing of the numbers corresponding to each synset type. ''lex_filenum '' is a two digit decimal integer representing the name of the lexicographer file containing the synset for the sense. See 'lexnames(5WN) for the list of lexicographer file names and their corresponding numbers. lex_id '' is a two digit decimal integer that, when appended onto ''lemma '', uniquely identifies a sense within a lexicographer file. ''lex_id '' numbers usually start with '''00 ', and are incremented as additional senses of the word are added to the same file, although there is no requirement that the numbers be consecutive or begin with 00 '. Note that a value of '''00 ' is the default, and therefore is not present in lexicographer files. Only non-default lex_id '' values must be explicitly assigned in lexicographer files. See 'wninput(5WN) for information on the format of lexicographer files. head_word '' is only present if the sense is in an adjective satellite synset. It is the lemma of the first word of the satellite's head synset. ''head_id '' is a two digit decimal integer that, when appended onto ''head_word '', uniquely identifies the sense of ''head_word '' within a lexicographer file, as described for ''lex_id ''. There is a value in this field only if ''head_word '' is present. Synset Type The synset type is encoded as follows: NOTES For non-satellite senses the''head_word ''and''head_id ''fields have no values, however the field separator character (': ) is present. ENVIRONMENT VARIABLES (UNIX) :Base directory for WordNet. Default is '/usr/local/WordNet-3.0 '. :Directory in which the WordNet database has been installed. Default is 'WNHOME/dict '. REGISTRY (WINDOWS) :Base directory for WordNet. Default is 'C:\Program Files\WordNet\3.0 '. FILES :sense index SEE ALSO '''binsrch(3WN),wnsearch(3WN),lexnames(5WN),wnintro(5WN),sensemap(5WN),wndb(5WN),wninput(5WN). ---- [http:// Table of Contents] *NAME *DESCRIPTION **File Format **Sense Key Encoding **Synset Type *NOTES *ENVIRONMENT VARIABLES (UNIX) *REGISTRY (WINDOWS) *FILES *SEE ALSO source http://wordnet.princeton.edu/wordnet/man/senseidx.5WN.html